Chapter 5 Parts-of-Speech Tagging

In many textual analyses, word classes can give us additional information about the text we analyze. These word classes typically are referred to as parts-of-speech tags of the words. In this chapter, we will show you how to POS tag a raw-text corpus to get the syntactic categories of words, and what to do with those POS tags.

In particular, I will introduce a powerful package spacyr, which is an R wrapper to the spaCy— “industrial strength natural language processing” Python library from https://spacy.io. In addition to POS tagging, the package provides other linguistically relevant annotations for more in-depth analysis of the English texts.

Again, the spaCy is optimized for many languages but Chinese. We will talk about Chinese text processing in a later chapter.

5.1 Installing the Package

Please consult the spacyr github for more instructions on installing the package.

There are at least four steps:

  1. Install miniconda (or any other conda version for Python)

  2. Install the spacyr R package

  1. Because spacyr is an R wrapper to a Python pacakge spaCy, now we need to install the python module (and the language model files) as well.

The easiest way to install Python spaCy is to install it in Rstudio through the R function spacyr::spacy_install(). This function by default creates a new conda environment called spacy_condaenv, as long as some version of conda has been installed on the user’s the system.

Please also note that spacyr uses Python 3.6.x and spaCy 2.2.3+.

The spacy_install() will create a stand-alone conda environment including a python executable separate from your system Python (or anaconda python), install the latest version of spaCy (and its required packages), and download the English language model.

Step 1 is very important. If you don’t have any conda version installed on your system, you can install miniconda from [https://conda.io/miniconda.html]https://conda.io/miniconda.html. (Choose the 64-bit version.) Also, the spacy_install() will automatically install the miniconda (if there’s no conda installed on the system) for MAC users.

Windows users may need to consult the spacyr github for more important instructions on installation.

For Windows, you need to run RStudio as an administrator to make installation work properly. To do so, right click the RStudio icon (or R desktop icon) and select “Run as administrator” when launching RStudio.

  1. Restart R and Initialize spaCy in R

5.2 Quick Overview

The spacyr provides a useful function, spacy_parse(), which allows us to parse an English text in a very convenient way.

The output parsedtext is a data frame, which includes annotations of the original texts at multiple granularities.

  • All texts have been tokenized into words with each word, sentence, and text given an unique ID (i.e., doc_id, sentence_id, token_id)
  • Lemmatization is also done (i.e., lemma)
  • POS Tags can also be found (i.e., pos and tag)
  • Depending on the argument setting for spacy_parse(), you can get more annotations, such as named entities (entity) and dependency relations (del_rel).

5.3 Working Pipeline

In Chapter 4, we provide a primitive working pipeline for text analytics. Here we like to revise the workflow to satisfy different goals in computational text analytics (See Figure 5.1).

After we secure a collection of raw texts as our corpus, if we do not need additional parts-of-speech information, we follow the workflow on the right.

If we need additional annotations from spacyr, we follow the workflow on the left.

Figure 5.1: English Text Analytics Flowchart

5.4 Parsing Your Texts

Now let’s use this spacy_parse() to analyze the presidential addresses we’ve seen in Chapter 4: the data_corpus_inaugural from quanteda.

To illustrate the annotation more clearly, let’s parse the first text in data_corpus_inaugural:

We can parse the whole corpus collection as well: we first apply the spacy_parse to each text in data_corpus_inaugural using map() and then rbind() individual resulting data frames into one using do.call().

##    user  system elapsed 
##  22.029   2.065  24.154

The function system.time() is a useful function which gives you the CPU times that the expression in the parathesis used. In other words, you can put any R expression in the parenthesis of system.time() as its argument and measure the time required for the expression.

This is sometimes necessary because some of the data processing can be very time consuming. And we would like to know HOW time-consuming it is in case that we may need to run the prodecure again.

Before we move on, we need to clean up the doc_id column of corp_us_words. We somehow lost the document ID’s when we used the map().

Now the document ID information is in the row names of corp_us_words. So we retreive the document filenames in the row names as the doc_id.


Exercise 5.1 In corpus linguistics analysis, we often need to examine constructions on a sentential level. It would be great if we can transform the word-based data frame into a sentence-based one for more efficient later analysis. Also, on the sentential level, it would be great if we can preserve the information of the lexical POS tags. How can you transform the corp_us_words into one as provided below? (You may name the sentence-based data frame as corp_us_sents.)

5.5 Metalingusitic Analysis

Now spacy_parse() has enriched our corpus data with more linguistic annotations. We can now utilize the additional POS tags for more analysis.

In many applied linguistics studies, people sometimes look at the syntactic complexity of the language across a particular factor. For example, people may look at the syntacitc complexity development of L2 learners of varying proficiency levels, or of L1 speakers in different acquisition stages, or of writers in different genres (e.g., academic vs. nonacademic).

To operationalize the construct sytactmic complexity, we use a simple metric, Fichtner's C, which is defined as:

\[ Fichtner's\;C = \frac{Number\;of\;Verbs}{Number\;of\;Sentences} \times \frac{Number\;of\;Words}{Number\;of\;Sentences} \]

Now we can take the corp_us_words and first generate the frequencies of verbs, and number of words for each presidential speech text.

With the syntactic complexity of each president, we can plot the tendency:

It’s interesting to see a decreasing trend in syntactic complexity!


Exercise 5.2 Please add a regression/smooth line to the above plot to indicate the downward trend?


5.6 Construction Analysis

Now with parts-of-speech tags, we are able to look at more linguistic patterns or constructions in detail. These POS tags allow us to extract more precisely the target patterns we are interested in.

In this section, we will use the output from Exercise 5.1. We assume that now we have a sentence-based corpus data frame, corp_us_sents. Here I like to provide a case study on English Preposition Phrases.

We can utilize the regular expressions to extract PREP + NOUN combinations from the corpus data.

In the above example, we specify the token= argument in unnest_tokens(..., token = ...) with a self-defined function. The idea of tokenization in unnest_tokens() is that the token argument should be a function which takes a text-based vector as input (i.e, each element of the input vector may be a document text) and returns a list, each element of which is a token-based version (i.e., vector) of the original input vector element (cf. Figure 5.2).

Intuition for `token=` in `unnest_tokens()`

Figure 5.2: Intuition for token= in unnest_tokens()

In our demonstration, we define a tokenization function, which takes sentence_tag as the input and returns a list, each element of which consists a vector of tokens matching the regular expressions in individual sentences in sentence_tag. (Note: The function object is not assigned to an object name, thus never being created in the R working session.)


Exercise 5.3 Create a new column, pat_clean, with all annotations removed in the data frame result_pat1.

With these constructional tokens of English PP’s, we can then do further analysis.

  1. We first identify the PREP and NOUN for each constructional token.
  2. We then clean up the data by removing POS annotations.

Now we are ready to explore the text data.

  • We can look at how each preposition is being used by different presidents:
  • We can examine the most frequent NOUN that co-occurs with each PREP:
  • We can also look at a more complex usage pattern: how each president uses the PREP of in terms of their co-occurring NOUNs?

Exercise 5.4 In our earlier demonstration, we made a naive assumption: Preposition Phrases include only those cases where PREP and NOUN are adjacent to each other. But there are many more tokens where words do come between the PREP and the NOUN (e.g., with greater anxieties, by your order). Please revise the regular expression to improve the retrieval of the English Preposition Phrases from the corpus data corp_us_sents. Specifically, we can define an English PP as a sequence of words, which start with a preposition, and end at the first word after the preposition that is tagged as NOUN, PROPN, or PRON.
Exercise 5.5 Based on the output from Exercise 5.4, please identify the PREP and NOUN for each constructional token and save information in two new columns.

5.7 Issues on Pattern Retrieval

Any automatic pattern retrieval comes with a price: there are always errors returned by the system.

I would like to discuss this issue based on the second text, 1793-Washington. First let’s take a look at the Preposition Phrases extracted by my regular expression used in Exercise 5.4 and 5.5:

My regular expression has identified 20 PP’s from the text. However, if we go through the text carefully and do the PP annotation manually, we may have different results.


Manual Annotation of English PP's in 1793-Washington

Figure 5.3: Manual Annotation of English PP’s in 1793-Washington


There are two types of errors:

  • False Positives: Patterns identified by the system but in fact they are not true patterns.
  • False Negatives: True patterns in the data but are not successfully identified by the system.

As shown in Figure 5.3, manual annotations have identified 21 PP’s from the text while the regular expression identified 20 tokens. A comparison of the two results shows that:

  • In the regex result, the following returned tokens (rows highlighted in red) are False Positives—the regular expression identified them as PP but in fact they were NOT PP according to manual annotations.

doc_id sentence_id PREP NOUN pat_pp row_id
1793-Washington 1 by voice by/adp the/det voice/noun 1
1793-Washington 1 of country of/adp my/det country/noun 2
1793-Washington 1 of chief of/adp its/det chief/propn 3
1793-Washington 2 for it for/adp it/pron 4
1793-Washington 2 of honor of/adp this/det distinguished/adj honor/noun 5
1793-Washington 2 of confidence of/adp the/det confidence/noun 6
1793-Washington 2 in me in/adp me/pron 7
1793-Washington 2 by people by/adp the/det people/noun 8
1793-Washington 2 of united of/adp united/propn 9
1793-Washington 3 to execution to/adp the/det execution/noun 10
1793-Washington 3 of act of/adp any/det official/adj act/noun 11
1793-Washington 3 of president of/adp the/det president/propn 12
1793-Washington 3 of office of/adp office/noun 13
1793-Washington 4 in presence in/adp your/det presence/noun 14
1793-Washington 4 during administration during/adp my/det administration/noun 15
1793-Washington 4 of government of/adp the/det government/propn 16
1793-Washington 4 in instance in/adp any/det instance/noun 17
1793-Washington 5 to upbraidings to/adp the/det upbraidings/noun 18
1793-Washington 5 of who of/adp all/det who/pron 19
1793-Washington 5 of ceremony of/adp the/det present/adj solemn/adj ceremony/noun 20

  • In the above manual annotation (Figure 5.3), phrases highlighted in red are NOT successfully identified by the current regex query, i.e., False Negatives.

We can summarize the pattern retrieval results as:



Most importantly, we can describe the quality of the pattern retrieval with two important measures.

  • \(Precision = \frac{True\;Positives}{True\;Positives + False\;Positives}\)
  • \(Recall = \frac{True\;Positives}{True\;Positives + False\;Negatives}\)

In our case:

  • \(Precision = \frac{18}{18+2} = 90%\)
  • \(Recall = \frac{18}{18 + 3} = 85.71%\)

It is always very difficult to reach 100% precision or 100% recall for automatic retrieval of the target patterns. Researchers often need to make a compromise. The following are some heuristics based on my experiences:

  1. For small datasets, probably manual annotations give the best result.
  2. For moderate-sized dataset, semi-automatic annotations may help. Do the automatic annotations first and follow up with manual checkups.
  3. For large datasets, automatic annotations are preferred in order to examine the general tendency. However, it is always good to have a random sample of the data to check the query performance.
  4. The more semantics-related the annotations, the more likely one would adopt a manual approach to annotation (e.g., conceptual metaphors, sense distinctions, dialogue acts).
  5. Common annotations of corpus data may prefer an automatic approach, such as Chinese word segmentation, POS tagging, named entity recognition, chunking, noun-phrase extractions, or dependency relations(?).

5.8 Saving POS-tagged Texts

We may very often get back to our corpus texts again and again when we explore the data. In order NOT to re-tag the texts every time we analyze the data, it would be more convenient if we save the tokenized texts with the POS tags in external files. Next time we can directly load these files without going trough the POS-tagging again.

However, when saving the POS-tagged results to an external file, it is highly recommended to keep all the tokens of the original texts. That is, leave all the word tokens as well as the non-word tokens intact.

A few suggestions:

  1. If you are dealing with a small corpus, I would probably suggest you to save the resulting data frame from spacy_parse() as a csv for later use.
  2. If you are dealing with a big corpus, I would probably suggest you to save the parsed output of each text file in an independent csv for later use.

5.9 Finalize spaCy

While running spaCy on Python through R, a Python process is always running in the background and R session will take up a lot of memory (typically over 1.5GB).

spacy_finalize() terminates the Python process and frees up the memory it was using.


Exercise 5.6 In this exercise, please use the corpus data provided in quanteda.textmodels::data_corpus_moviereviews. This dataset is provided as a corpus object in the package quanteda.textmodels (please install the package on your own). The data_corpus_moviereviews includes 2,000 movie reviews.

  1. Please use the spacyr to parse the texts and provide the top 20 adjectives for positive and negative reviews respectively. Adjectives are naively defined as any words whose pos tags start with “J” (please use the fine-grained version of the POS tags. i.e., tag, from spacyr). When computing the word frequencies, please use the lemmas instead of the word forms.

  2. Please provide the top 20 words that are content words for positive and negative reviews ranked by a weighted score, which is computed using the formula provided below. Content words are naively defined as any words whose pos tags start with N, V, or J.

\[Word\;Frequency \times log(\frac{Numbe\; of \; Documents}{Word\;Diserpsion}) \]

  • For example, if the lemma action occurs 691 times in the negative reviews collection. These occurrences are scattered in 337 different documents. There are 1000 negative texts in the current corpus. Then the wegithed score for action is:

\[691 \times log(\frac{1000}{337}) = 751.58 \]


In our earlier chapters, we have discussed the issues of word frequencies and their significance in relation to the dispersion of the words in the entire corpus. In terms of identifying important words from a text collection, our assumption is that: if a word is scattered in almost every document in the corpus collection, it is probably less informative. For example, words like a, the would probably be observed in every document in the corpus. Therefore, the high frequencies of these widely-dispersed words may not be as important compared to the high frequencies of those which occur in only a subset of the corpus collection. The word frequency is sometimes referred to as term frequency (tf) in information retrieval; the dispersion of the word is referred to as document frequency (df). In information retrieval, people often use a weighting scheme for word frequencies in order to extract informative words from the text collection. The scheme is as follows:

\[tf \times log(\frac{N}{df}) \]

N refers to the total number of documents in the corpus. The \(log\frac{N}{df}\) is referred to as inversed document frequency (idf). This tf.idf weighting scheme is popular in many practical applications.

The smaller the df of a word, the higher the idf, the larger the weight for its tf.